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Abstract — We consider an opportunistic spectrum access 
(OSA) problem where the time-varying condition of each channel 
(e.g., as a result of random fading or certain primary users' 
activities) is modeled as an arbitrary finite-state Markov chain. 
At each instance of time, a (secondary) user probes a channel 
and collects a certain reward as a function of the state of the 
channel (e.g., good channel condition results in higher data rate 
for the user). Each channel has potentially different state space 
and statistics, both unknown to the user, who tries to learn 
which one is the best as it goes and maximizes its usage of 
the best channel. The objective is to construct a good online 
learning algorithm so as to minimize the difference between the 
user's performance in total rewards and that of using the best 
channel (on average) had it known which one is the best from 
a priori knowledge of the channel statistics (also known as the 
regret). This is a classic exploration and exploitation problem 
and results abound when the reward processes are assumed to 
be iid. Compared to prior work, the biggest difference is that 
in our case the reward process is assumed to be Markovian, 
of which iid is a special case. In addition, the reward processes 
are restless in that the channel conditions will continue to evolve 
independent of the user's actions. This leads to a restless bandit 
problem, for which there exists little result on either algorithms 
or performance bounds in this learning context to the best of our 
knowledge. In this paper we introduce an algorithm that utilizes 
regenerative cycles of a Markov chain and computes a sample- 
mean based index policy, and show that under mild conditions 
on the state transition probabilities of the Markov chains this 
algorithm achieves logarithmic regret uniformly over time, and 
that this regret bound is also optimal. We numerically examine 
the performance of this algorithm along with a few other learning 
algorithms in the case of an OSA problem with Gilbert-Elliot 
channel models, and discuss how this algorithm may be further 
improved (in terms of its constant) and how this result may lead 
to similar bounds for other algorithms. 

I. Introduction 

In this paper we study the following opportunistic spectrum 
access (OSA) problem. A (secondary) user has access to a set 
of K channels, each of time-varying conditions as a result of 
random fading and/or certain primary users' activities. Each 
channel is thus modeled as an arbitrary finite-state discrete- 
time Markov chain. At each time step, the secondary user 
(simply referred to as the user for the rest of the paper for there 
is no ambiguity) probes a channel to find out its condition, 
and is allowed to use the channel in a way consistent with 
its condition. For instance, good channel conditions result in 
higher data rates or lower power for the user and so on. This 
is modeled as a reward collected by the user, the reward being 
a function of the state of the channel or the Markov chain. 



Channels have potentially different state spaces and statis- 
tics, both unknown to the user. The user will thus try to 
learn which one is the best and maximizes its usage of the 
best channel. Within this context, the player's performance is 
typically measured by the notion of regret. It is defined as 
the difference between the expected reward that can be gained 
by an "infeasible" or ideal policy, i.e., a policy that requires 
either a priori knowledge of some or all statistics of the arms or 
hindsight information, and the expected reward of the player's 
policy. The most commonly used infeasible policy is the best 
single action policy, that is optimal among all policies that 
continue to play the same arm. An ideal policy could play for 
instance the arm that has the highest expected reward (which 
requires statistical information but not hindsight). This type 
of regret is sometimes also referred to as the weak regret, see 
e.g., work by Auer et al. (TJ. In this paper we will only focus 
on this definition of regret. 

The above can be cast as a single player multiarmed bandit 
problem, where the reward of each channel (also referred to 
as an arm in the bandit problem literature) is generated by 
a Markov chain with unknown statistics. Furthermore, it is a 
restless bandit problem because the state of each Markov chain 
evolves independent of the action of the user (whether the 
channel is probed or not); by contrast, in a classic multiarmed 
bandit problem the state of a Markov chain only evolves when 
it is acted upon and stays frozen otherwise (also referred to 
as rested). The restless nature of the Markov chains follows 
naturally from the fact that channel conditions are governed by 
external factors like random fading, shadowing, and primary 
user activity. 

In the remainder of this paper a channel will also be referred 
to as an arm, the user as the player, and probing a channel as 
playing or selecting an arm. This problem is a typical example 
of the tradeoff between exploration and exploitation. On the 
one hand, the player needs to sufficiently explore all arms so as 
to discover with accuracy the best arm and avoid getting stuck 
playing an inferior one erroneously believed to be the best. 
On the other hand, he needs to avoid spending too much time 
sampling the arms and collecting statistics and not playing the 
best arm often enough to get a high return. 

In most prior work on the class of multiarmed bandit 
problems, originally proposed by Robbins Q, the rewards 
are assumed to be independently drawn from a fixed (but 
unknown) distribution. Its worth noting that with this iid 
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assumption on the reward process, whether an arm is rested 
or restless is inconsequential for the following reasons. Since 
the rewards are independently drawn each time, whether an 
unselected arm remains still or continues to change does 
not affect the reward the arm produces the next time it is 
played whenever that may be. This is clearly not the case 
with Markovian rewards. In the rested case, since the state 
is frozen when an arm is not played, the state in which we 
next observe the arm is independent of how much time elapses 
before we play the arm again. In the restless case, the state of 
an arm continues to evolve, thus the state in which we next 
observe it is now dependent on the amount of time that elapses 
between two plays of the same arm. This makes the problem 
significantly more difficult. 

To the best of our knowledge, there has been no study of 
the restless bandits in this learning context, either in terms 
of algorithms or performance bounds. Here lies the main 
contribution of the present study. In this paper we give the 
first result on the existence of order-optimal policies for the 
above restless bandit problem. Specifically, we introduce an 
algorithm that utilizes regenerative cycles of a Markov chain 
and computes a sample-mean based index policy, and show 
that under mild conditions on the state transition probabilities 
this algorithm achieves logarithmic regret uniformly over time. 

Below we briefly summarize the most relevant results in the 
literature. Lai and Robbins in model rewards as single- 
parameter univariate densities and give a lower bound on the 
regret and construct policies that achieve this lower bound 
which are called asymptotically efficient policies. This result 
is extended by Anantharam et al. in JfQ to the case of playing 
more than one arm at a time. Using a similar approach 
Anantharam et al. in [0 develops index policies that are 
asymptotically efficient for arms with rewards driven by finite, 
irreducible, aperiodic and rested Markov chains with identical 
state spaces and single-parameter families of stochastic tran- 
sition matrices. Agrawal in [6] considers sample mean based 
index policies for the iid model that achieve O(logn) regret, 
where n is the total number of plays. Auer et al. in Q also 
proposes sample mean based index policies for iid rewards 
with bounded support; these are derived from |6|, but are 
simpler than the those in [6 1 and are not restricted to a specific 
family of distributions. These policies achieve logarithmic 
regret uniformly over time rather than asymptotically in time, 
but have bigger constant than that in ||3l. In ||8l it is shown 
that the index policy in Q is order optimal for Markovian 
rewards drawn from rested arms but not restricted to single- 
parameter families, under some assumptions on the transition 
probabilities. 

Other works such as |9|, [10], IfTTI consider the iid reward 
case in a multiuser setting; players selecting the same arms 
experience collision according to a certain collision model. 
We would like to mention another class of multiarmed bandit 
problems in which the statistics about the problem are known 
a priori and the state is observed perfectly; these are thus 
optimization problems rather than learning problems. The 



rested case is considered by Gittins [ 12] and the optimal policy 
is proved to be an index policy which at each time plays the 
arm with highest Gittins' index, while Whittle introduced the 
restless bandit problem in 0131 . The restless bandit problem 
does not have a known general solution though special cases 
may be solved. For instance, a myopic policy is shown to 
be optimal when channels are identical and bursty in lfl4l 
for an OSA problem formulated as a restless bandit problem 
with each channel modeled as a two-state Markov chain (the 
Gilbert-Elliot model). 

The remainder of this paper is organized as follows. In Sec- 
tion we formulate the single player restless bandit problem. 
In Section|III]we introduce an algorithm based on regenerative 
cycles that employs sample-mean based indices. The regret of 
this algorithm is analyzed and shown to be optimal in Section 
II VI In Section [V] we numerically examine its performance 
along with a few other learning algorithms in the case of an 
OSA problem with Gilbert-Elliot channel models, and discuss 
how this algorithm may be further improved (in terms of its 
constant) and how this result may lead to similar bounds for 
other algorithms. Finally, Section IVT1 concludes the paper. 

II. Problem Formulation and Preliminaries 

Consider K arms (or channels) indexed by i = 1, 2, • • ■ , K . 
The ith arm is modeled as a discrete-time, irreducible and 
aperiodic Markov chain with finite state space S z . There is 
a stationary and positive reward associated with each state 
of each arm. Let rl x denote the reward obtained from state 
x of arm i, x G S l ; this reward is in general different 
for different states. Let P % = {p xy ,x,y £ S*'} denote the 
transition probability matrix and 7r l = {tt x ,x G S 1 } the 
stationary distribution of arm i. 

Let (P 1 )' denote the adjoint of P i on and let P i = 

(P l )'P denote the multiplicative symmetrization of P i , where 

(P% y = (<4x)A4 Vx,yGS\ 

We will assume that the P l s are such that P*s are ir- 
reducible. To give a sense of how strong this assumption 
is, we note that one condition that guarantees that the P l s 
are irreducible is p xx > 0,Vx G S^Vi. For the application 
under consideration, this condition means that there is always 
positive probability for a channel to remain in the same state 
over one unit of time, which appears to be a natural and benign 
assumption^ 

We assume the arms (or Markov chains) are mutually 
independent and are restless, i.e., their states will continue 
to evolve regardless of the user's actions. The mean reward of 
arm i, denoted by /i\ is the expected reward of arm i under 
its stationary distribution: 

S = E r *< ■ 

xes i 

1 Alternatively we could adopt a stronger assumption that the Markov chains 
are aperiodic and reversible (note that aperiodicity and reversibility implies 
that the multiplicative symmetrization of P % is irreducible), in which case 
the same order results can be obtained with a different constant if we use a 
different large deviation bound from [15] instead of Lemma 1. 
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For convenience, we will use * in the superscript to denote the 
arm with the highest mean. For instance, p* = maxi<;<#- p l , 
and so on. We assume that the arm with the highest mean is 
unique. 

Consistent with the discrete-time Markov chain model, we 
will assume that the user's actions occur also in discrete time. 
For a policy a we define its regret R a (n) as the difference 
between the expected total reward that can be obtained by 
playing the arm with the highest mean and the expected total 
reward obtained from using policy a up to time n. Always 
playing the arm with the highest mean reward is referred to as 
the best single-action policy, and this arm will also be referred 
to as the optimal arm; accordingly the others will be referred 
to as suboptimal arms. 

Let a(t) denote the arm selected by policy a at t, t = 
1, 2, • • • , and x a (t) the state of arm a(t) at time t. Then we 
have 



R a (n) 



np, 



E a 



(2) 



The objective is to examine how the regret R a (n) behaves 
as a function of n for a given policy a and to construct 
a policy whose regret is order-optimal, through appropriate 
bounding. As we will show and as is commonly done, the key 
to bounding R a (n) is to bound the expected number of plays 
of any suboptimal arm. 

Our analysis utilizes the following known results on Markov 
chains; the proofs are not reproduced here for brevity. The first 
is a result by Lezaud [ 16 1 that bounds the probability of a large 
deviation from the stationary distribution. 

Lemma 1: [Theorem 3.3 from [16|] Consider a finite-state, 
irreducible Markov chain {X t } t>1 with state space S, matrix 
of transition probabilities P, an initial distribution q and 



stationary distribution %. Let i\T q 



Let 



P = P'P be the multiplicative symmetrization of P where 
P' is the adjoint of P on ^(tr). Let e = 1 — A2, where A2 
is the second largest eigenvalue of the matrix P. e will be 
referred to as the eigenvalue gap of P. Let / : S — >• 7Z be 



such that Eyes^vfiy) = 0, 11/11, 
If P is irreducible, then for any 

< 7 < 1 



„ < 1 and < ll/IG < 1. 
positive integer n and all 



P 



> 7 ) < N q exp 



wj 2 e 



2N 



The second is a result by Bremaud, which can be found in 

El. 

Lemma 2: If {X n } n>0 is a positive recurrent homogeneous 
Markov chain with state space S, stationary distribution ir and 
t is a stopping time that is finite almost surely for which 
X T = x then for all y e S 



E 



Y,i(Xt = y)\x = x 



t=o 



= E[t\Xq = x]lTy 



In the next two sections we first present a policy, referred 
to as the regenerative cycle algorithm, and then analyze its 
regret. 

III. Regenerative Cycle Algorithm (RCA) 

In this section we present an algorithm called the regener- 
ative cycle algorithm (RCA), and prove in the next section 
that this algorithm guarantees a logarithmic growth of the 
regret uniformly over time under mild assumptions on the state 
transition probabilities and the rewards. 

As the name suggests, this algorithm operates on regenera- 
tive cycles. In essence what the algorithm does is to construct 
sample paths of each arm solely using those observed within 
regenerative cycles while discarding the rest in its estimation 
of the quality of an arm (in the form of an index). The reason 
behind such a construction has to do with the restless nature 
of the arms. As noted in the introduction, since each arm 
continues to evolve according to the Markov chain regardless 
of the user's action, the probability distribution of the reward 
we get by playing an arm is a function of the amount of time 
that has elapsed since the last time we played the same arm. 
Since we play one arm at a time, the arms become coupled (in 
terms of the probability distributions of the rewards). While 
this certainly does not affect our ability to collect rewards, it 
makes it extremely hard to analyze the estimated quality (or 
the index) of an arm calculated based on rewards collected 
this way. 

However, if instead of the actual sample path of all ob- 
servations from an arm, we limit ourselves to a sample 
path constructed (or rather stitched together) using only the 
observations from regenerative cycles, then this sample path 
essentially has the same statistics as the original Markov chain 
due to the renewal property and one can now use the sample 
mean of the rewards from the regenerative sample paths to 
approximate the mean reward under stationary distribution. 

FigureQ]illustrates one possible realization of this algorithm. 
As shown, RCA operates in blocks. Within a block, the 
algorithm plays the same arm in each time slot (arm i in the 
first block in this example) till a certain pre-specified state (say 
7*) is observed. Upon this observation we enter a regenerative 
cycle and continue to play till the same state 7* is observed 
a second time. This marks the end of the block labeled "play 
arm i". At the end of each block, the algorithm computes an 
index for all arms and selects the one with the highest index to 
play in the next block (arm k shown in the figure). It follows 
that the block length is a random variable. 

For the purpose of index computation and subsequent 
analysis, each block is further broken into three sub-blocks 
(SBs). SB1 consists of all time slots from the beginning of 
the block to right before the first visit to r f; SB2 includes all 
time slots from the first visit to j l up to but excluding the 
second visit to state 7*; SB3 consists of a single time slot 
with the second visit to 7*. These are also shown in Figure Q] 
The key to the algorithm is for each arm to single out only 
observations within SB2's in each block and virtually assemble 
them (these are highlighted with thick lines). Because of the 



4 



compute index 



play arm i 



compute index 

jL 



play arm k 



compute index 

J 



• • • 




Y 1 


• • • 


yfe • • • 





SB 3 SB1 



" play arm j 



play arm i 



• • • 


yi • 


i 


• • • 


• • • 


y • • • 


y 


• • • 


• « • 


y : • • • 


y 


• • • 


4: 


SB1 


SB2 


SB3 


SB1 


SB2 


SB3 


SB1 


SB2 


SB3 


5: 



Fig. 1 . Example realization of RCA 

regenerative nature of the Markov chain, once put together, 
the resulting sample path has exactly the same statistics as 
given by the transition probability matrix P % \ this results in a 
tractable problem. 

Throughout our discussion, we will consider a horizon of n 
time slots. A list of notations used is summarized as follows; 
some are also marked on Figure [2] for convenience: 
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block b(n) 
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slotT(n) slot(n) 

Fig. 2. Running RCA over a period of n slots 

• r f\ state that determine the regenerative cycles for arm i. 

• a(b): the arm played in block b. 

• b(n): total number of completed blocks up to time n. 

> T(n): time at the end of the last completed block. 

> T 4 (n): total number of times (slots) arm i is played up 
to time T(n). 

m B l (b): total number of blocks within the first b blocks in 
which arm i is played. 

• X\(j): vector of observed states from SB1 of the jth 
block in which arm i is played; it is empty if the first 
observed state is r f. 

• X 2 (j): vector of observed states from SB2 of the jth 
block in which arm i is played; 

• X l (j): vector of observed states from the jth block 
in which arm i is played. Thus we have X l (j) = 

[xi(j),x|(j),y]. 

• t(b): time at the end of block b; t(b) = 

E£iES 6) l**0')|. 

« T l (t{b)): total number of time slots arm i is played up 
to time t(b). Thus T l {t(b)) = Y^fl? \ X 'U)\- A1 s» note 
that T l {t{b{n))) = T l {n). 

• t 2 (b): total number of time slots spent in SB2 up to block 

b. Thus t 2 {b) = Y.tiY.fJ" ) \xm- 

> r l (k): the reward from arm i when it's played for the fc-th 
time, counting only those times played during an SB2. 

• TZitzQ))): total number of time slots arm i is played dur- 
ing SB2 up to block b. Thus T^(t 2 (b)) = Efl? 5 \ x \{i)\- 

RCA computes and updates the value of an index g 1 for 
each arm i at the end of block b, based on the total reward 
obtained from arm i during all SB2 as follows: 



Regenerative Cycle Algorithm (RCA): 



l,t = 0,t 2 



0,T 9 * 



0,r ! 



Initialize: b 

,K 

for b < K do 

play arm b; set 7 b to be the first state observed 
t := t + 1; t 2 := h + 1; T| := T* + 1; r b := r b 
play arm b; denote observed state as x 
while x ^ 7 6 do 



t:=t+l;t 2 := t 2 



1 



T 2 6 :=T 2 6 - 



1; 



play arm b; denote observed state as x 
end while 

b := 6 + 1; t :=t + l 
end for 

for j = 1 to K do 

compute index g* := ^ + 

j + + 
end for 

i := argmaxj g 3 
while (1) do 

play arm i; denote observed state as x 

while x ^ 7 l do 

t:=t + l 

play arm i; denote observed state as x 
end while 



t 



t + l;t 2 :=t 2 + l; T 2 := Ti + 1; r l 



play arm i; denote observed state as x 

while x ^ 7 l do 

t:=t + l;t 2 := t 2 + 1; T\ := T l 2 + 1; r l 
play arm i; denote observed state as x 

end while 

b:=b+l;t:=t + l 

for j = 1 to K do 



compute index g 3 := ~ 

^2 

J+ + 

end for 

i := argmaxj g 3 
end while 



L In t 2 



Fig. 3. Pseudocode of RCA 



Llnt 2 (6) 



4(6),T|( i2 (6)) = *=*(23(t2(&))) + \j mt 2 {b)Y 



(3) 



where i is a constant, and 

r'(l) +r l (2) 



r\n{t 2 {b)) 



r l (mt 2 {b))) 



Ti(t 2 {b)) 

denotes the sample mean of the reward collected during 
an SB2: X|(2), • • • ,^(^(6)) (this is ai-m i's total 

reward over the total number of times it's played). The 
second term in the index computation serves the purpose of 
exploration: the relative uncertainty of the mean reward of an 
arm grows as the arm is not played. This index definition is 
similar to that proposed in Q, but computed only over SB2s. 
RCA is formally given in Figure [3] In this description the 
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algorithm continues indefinitely, but can obviously be stopped 
at anytime that some desired horizon is reached. 

Its worth noting that RCA also collects reward during SB1 
and SB3. However, the computation of the indices only relies 
on SB2. The reason becomes clearer in the next section 
where we analyze its regret and show that it grows at most 
logarithmically in n. 

IV. Regret analysis of RCA 

We begin by bounding the expected number of plays of a 
suboptimal arm. 

Theorem 1: Assume all arms are finite-state, irreducible, 
aperiodic Markov chains whose transition probability ma- 
trices have irreducible multiplicative symmetrizations and 
assume all rewards are positive. Let 71"^^ = min xeS i ir x , 

TTmin = mmi<j<K ^min' r max = max ieSM<!<K r i> Smax = 

maxi<i<K \S% 7r max = max^s^KK^ {tt x , 1 - ?r* }, 
£min = mirii^^ e\ M; nax = max^^j.^^M^, where 
e 1 is the eigenvalue gap of the multiplicative symmetrization 
of the transition probability matrix of the ith arm and M x is 
the mean hitting time of state y starting from an initial state x 
for the ith arm. Then for a player using RCA with a constant 
L > 112S^ ax r^ ax 7r^ ax /e min in ©, we have 



E 

i-.fj,' <^t* 



(p* - S)E[T\n)] 
Da In n 



<fi* HfX 1 </i* 



Y (m*-m ,; )ac- 



where 



C, 



D, 



{\S l \ + \S*\)P 



]\/fi I 1 

max ' 



p = Y*~ 



t=i 



Proof: Throughout the proof all quantities pertain to 
RCA, which will be denoted by a and suppressed from 
the superscript whenever there is no ambiguity. Let c tjS = 
y/Llnt/s, and let I be any positive integer. Then, 

b 

B i (b) = l + Y I(at(m)=i) 

m=K+l 

I(a(m) = i.B^m - 1) > I) 

^(fl , t 2 (m-l),T 2 *(i 2 (m-l)) 



< I 



< I 



b 

E 

m=K+l 
b 

E 

m=K+l 



< 



Sta(m-l),l3(t a (n»-l))>- Bi ( Tn - 1 )^0 



b t 2 (m-l) t 2 (m-l) 

^ l + E E E ^ w , s <51hJ( 4 ) 

m=K+l s=l s i= t 2 (l) 
t 2 (b) t-l t-l 

t=l s=l s i= l 

where as given in d3}, g\ s = f l (s) + ct, s - The inequality in 
(O follows from the fact that the outer sum in is over time 
while the outer sum in is over blocks and each block lasts 
at least two time slots. 

We now show that g^ s < g\ s . implies that at least one of 
the following holds: 

r*(s) < M*-c t , s (6) 
f\s l ) > //+c Ms (7) 
H* < ^ + 2c Mi . (8) 

This is because if none of the above holds, then we must have 

9t, s = r*(s) + c t>a > fi* > [i l + 2c t ,si > f\si) + c Mi = g\^ } 

which contradicts g\ s < g\ s .. 

If we choose s t > 4Zln(i 2 (&))/(/** - \i l f, then 2ct )Si < 
fj,* — // for t < t2(b), which means ([H) is false, and therefore 
at least one of (|6) and ^ is true with this choice of s,. 
We next take I = *fM 1 , and proceed from ©. Taking 

expectation on both sides and relaxing the outer sum in © 
from t 2 (b) to oo, 

E[B\b)] < 



4L ]nt 2 (b)~ 

oo t-l t-l 

EE E 

t = l s=l _r 4tlnta(M ] 



+ EE E 

t— 1 S — 1 _r4Z,lnt 2 ( b ) 



P(f*(s) < /x* — Ct,s) 



Consider an initial distribution q l for the ith arm. We have: 



A,, 



<E 

2 yeS* 



< 



1 



< I 



b 

m ^+l V<s<M™-1) 



where the first inequality follows from Minkowski inequality. 
Let n,y(t) denote the number of times state y of arm i is 
observed during all SB2s up to the tth play. Then, 

= p E r >t( s «) ^ s * E r X + s ' ct ^ 



y&Si 



max qI /— ii „ 

t2(l)<Si<t2{m— 1) v " 



6 



= ^(E(- r X( s *) + ^i7r«)<- Si c Mi ) . (9) 

y£S* 

Consider a sample path oj and the events 

1 = ' 1 E (~ r y n 2/( S 0( W ) + r l S i^y) < -SiCt, Si f ■ 

yes* 



B = \J iu: -rini(si)(u) + r l ySl ir l y 



< 



yes 
If uj B then 



rtn*(s<)(w)+r*s»7r* > 



SiC t , . 



v 1 v \S\ 

Thus u £ A, and P(A) < P{B). Then continuing from ©: 

P(f*(*i) >/i i + c t , Si ) 

< E p (- r W^)+w^-ff) 



yes' 

< ^N q d 

yes* 
\S 



SjC t .. 

15* 



28(|S»|r«*«)^ 



< J — L i 



(10) 
(11) 



where (TToT > follows from letting 



7 



= y) - 7T« 



and using Lemma Q] (note P* is irreducible), which gives 



< N q it 28 <i s, i"yn) 2 



(12) 



We note that for 7 > 1 the deviation probabiltiy is zero so the 
bound still holds. 
Similarly, we have 

P(f*(s) <lf-Ct,s) 

= P ( E r aK( s ) ~ S7r a) - " sc m) 



yes- 



yes* 



S* 



E p K( s - E - ^t 1 - E <) ^ 



yes* 



x^y 



x^y 



SCj,. 

' IS* 



yes* 



3/ 1 

x¥=y 



* * ^ set,. 



< V N *i 



< 



/ 4 -'9* 

yes* 



28(|S* 



(13) 
(14) 



where (fT3l > again follows from Lemma [T] Since 

n*. 1 00 £ — 1 f— 1 T , 



I5 1 



IS" 1 ! + 15* 



7T n 



< 



IS* 



EEE*~ 

t = l S = l Sj=l 
00 __Le mJa -5e 

00 

E'~ 2 - 



(15) 



from ( fTTT i and ( Tl4l . given 6(n) = 6 we have 



E[B\b{n))\b{n) = b]< 



ALlnt 2 (b) 



(\S l \ + \S*\)p 



(m* - M 1 ) 2 

for all suboptimal arms. The inequality in (TT~5T > follows from 
the assumption L > 112S , ^ lax r^ lax 7r^ lax /e m i n . Therefore, 

ALhxn , + 15*1)^ 



£?[B»(6(n))] < 



1 



(16) 



(/i*-/ii)2 

since n > i 2 (k( n )) almost surely. 

Note that all the quantities in computing the indices 
and the probabilities in above comes from the intervals 
X|(l),X|(2),---Vi S {1,---,K}. Since these intervals 
begin with state 7* and end with a return to 7*, by the 
strong Markov property the process at these stopping times 
have the same distribution as the original process. Moreover 
by connecting these intervals together we form a continuous 
sample path which can be viewed as a sample path generated 
by a Markov chain with an transition matrix identical to the 
original arm. This is the reason why we can apply Lezaud's 
bound to this Markov chain. 

The total number of plays of arm i at the end of block 
b{n) is equal to the total number of plays of arm i during the 
regenerative cycles of visiting state 7* plus the total number 
of plays before entering the regenerative cycles plus one more 
play resulting from the last play of the block which is state 
7 1 . This gives: 



E[T 



(n)} < (J- + M^ ax + l) E[B\b{n))] . 



Thus, 



]T ( m *-//)£[t>)] 

i:fi*<p," 



* 4i E 7^7)+ E Cm'-m'KWIt) 



We now state the main theorem of this paper. 
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Theorem 2: Assume all arms are finite-state, irreducible, 
aperiodic Markov chains whose transition probability ma- 
trices have irreducible multiplicative symmetrizations and 
assume all rewards are positive. Let 7r^ lin = mm xeS i tt 1 x , 
Tfmin = mini<j<j^ 7r^ lin , r max = max x(£S i 1 ^ i ^ K r x , S max = 
maxi<i<K {S^l, 7r max = \tl&x x ^ A <i< K {ir x , 1 - ir' x }, 
£min = mirii^^/f e\ M^ ax = wastry eS i >x ^ y M % xy , where 
e 4 is the eigenvalue gap of the multiplicative symmetriza- 
tion of the transition probability matrix of the ith arm and 
M x is the mean hitting time of state y starting from 
an initial state x for the ith arm. Then using a constant 
L > 112S'^ lax r 1 2 nax 7r 1 „ ax /e m i n , the regret of RCA can be upper 
bounded uniformly over time by the following, Vn: 



R RCA (n) < 4Llnn E "TT 



Di 



where 



i:/x l </j* 
i:/j'</i* 



A = I + M^ ax + 1 



Ei = tS(l + M* max )+Li*M, 



■ r 1 ?^ M m«£ + 1 
TTmin l€{l,...,K} 



Proof: Assume that the states which determine the regen- 
erative sample paths are given a priori by 7 = [7 1 , • • • , r ) K ]. 
We denote the expectations with respect to RCA given 7 as 
E 1 . First we rewrite the regret in the following form: 

T(n) 

R,(n) = M*^[n»)]-^E r S] 

t=l 

n 

+ l x*E 1 [n-T{n)]-E 1 [ £ r£«] 

t=T(n) + l 



K £ T P»] - E//£ 7 [^(n)] } - Z 7 (n) 



+,/£>- T(n)]-£ 7 [ £ r«« ] . (18) 

i=T(n)+l 

where for notational convenience, we have used 

K 



Z~j(ri) = E, 



T(n) 

E 

t=l 



E^ 7 [ T »] 



We can bound the first difference in ( fT8l logarithmically 
using Theorem [Tj so it remains to bound Z 1 (ri) and the last 



difference. We have 

Z» > E r* y E n 
yes* 



E E^ 

i:fi % <fi* yeS' 

l i*E 1 [T»] 



B*(6(n)) 

E E w = ») 

B'(b(n)) 

E E w=y) 



M^ ax + 1 £ 7 [B*(6(n))] 



f £ 7 

7 » 



where the inequality comes from counting only the rewards 
obtained during the SB2s for all suboptimal arms. Applying 
Lemma 12 to ( fT~9b we get 

E E ^ = '/- 

Rearrange terms and noting /1* = 

Z»>i?»- E 

where 



» 7T 



1)£ 7 (20) 



E^ 



B*(b(n)) 

E E 

3=1 X t *ex*(j) 



Consider now R*(n). Since all suboptimal arms are played 
at most logarithmically, the number of time steps in which 
the best arm is not played is at most logarithmic. It follows 
that the number of discontinuities between plays of the best 
arm is at most logarithmic. Suppose we combine successive 
blocks in which the best arm is played, and denote by X*(j) 
the j-th combined block. Let b* denote the total number of 
combined blocks up to block b. Each X* thus consists of two 
sub-blocks: XI that contains the states visited from beginning 
of X* (empty if the first state is 7*) to the state right before 
hitting 7*, and sub-block X^ that contains the rest of X* (a 
random number of regenerative cycles). 

Since a block X* starts after discontinuity in playing the 
best arm, b*(n) is less than or equal to total number of 
completed blocks in which the best arm is not played up to 
time n. Thus 

E E -yl Bi ( b ( n ))]- (2D 
We rewrite R*(n) in the following from: 

6*(n) 

E E ^ x t=y) (22i 

b'(n) 

E <^> 



fl » = Ev^ 

yes* 



E r >l E i 
yes* 



3=1 



x 



> 



E 
E 

yes* 

- fj,*M* 



6*(n) 

E 



E nx; = y) 



b'(n) 

E 

: E 



E 7 [B\b(n))] 



(24) 



(25) 



(26) 



where the last inequality is obtained by noting the difference 
between (l22l and (1231 is zero by Lemma [2] using positivity 
of rewards to lower bound (l24l i by 0, and (f2TT > to upper bound 
(l25l i. Combine this with ([Tol l and (f20]l we can thus obtain a 
logarithmic upper bound on — Z 1 {n). Finally, we have 



Upper 



Confidence 



Bound 



(UCB1): 



Initialize: n = 1 
for n < K do 

play arm n; n := n + 1. 
end for 

while n>if do 

r*(T*(n)) = r ' (1)+r ' (2 ^ (K) +r ' (T ' ( " )) , V* 

8: play arm j, such that j = argmaxjg^ Ti / y update 

r j (n) and T^'(n). 
9: n := n + 1 
10: end while 



Fig. 4. The UCB1 algorithm. 



^E^n-Tln)] -J5 7 [ 



E 1 

t=T(n)+l 



*(t) 



< 



ie{i,...,/f} 



A/' 



1 



(27) 



Therefore we have obtained the stated logarithmic bound for 
( |T8l >. Note that this bound does not depend on 7, and therefore 
is also an upper bound for R(n), completing the proof. ■ 
Therefore, given minimal information about the arms such 



as an upper bound for S, 



max max max 



/e m ; n the player can 



guarantee logarithmic regret by choosing an L in the RCA 
algorithm that satisfies the condition in Theorem [2] 

We end this section by noting that the logarithmic bound in 
n is also order optimal for this restless bandit problem, i.e., no 
better bound than In n is possible (however a better constant 
is possible). This follows from the fact that the rested bandit 
problem is a special case of the restless problem and in 
it is shown that the best order is logarithmic for the rested 
problem. Moreover, we conjecture that the order optimality 
of RCA holds when it is used with any index policy that is 
order optimal for the rested bandit problem. Because of the 
use of regenerative cycles in RCA, the observations used to 
calculate the indices can be in effect treated as coming from 
rested arms. Thus an approach similar to the one in Theorem 
[TJ can be used to prove order optimality. 

V. An Example: Gilbert-Elliot Channel Model 

In this section we simulate RCA and two other algorithms 
under the commonly used Gilbert-Elliot channel model, where 
each channel has two states, good and bad (or 1,0, respec- 
tively). The first algorithm is the upper confidence bound 
(UCB1) algorithm from Q. In (U we have proved that it 
has a logarithmic regret in the case of Markovian rewards 
when all arms are rested, by replacing the constant 2 in 
the index calculation of UCB1 with L and using a result 
from ifTBll . Using Lezaud's bound as we have done in the 
present paper it can be shown that this modified UCB1 
algorithm, shown in Figure HI has a logarithmic regret for 



The second algorithm is an online randomized algorithm 
proposed in (T), referred to as the Exp3 algorithm and shown 
in Figure [5] The main distinction of Exp3 is that it is a ran- 
domized algorithm: given all past observations the algorithm's 
current action is the outcome of a random variable. Random- 
ization is helpful when rewards from arms are determined 
by an adversary rather than a stochastic process. This is the 
context in which Exp3 is introduced and studied in (TJ. 



Exp 3: 

l: Initialize: select parameter a 6 (0, 1) and set weights 

«> < (l) = l,Vi€{l,2,-..,Jir} 
2: while (1) do 

3: at time n compute the probabilities p*(n) = (1 — 

w l (n) 



a ) w W l a_ Wj 

take a random sample of the random variable X(n) 
with pmf: P(X(n) = i) = p l (n); denote the outcome 
by a(n). 

play arm a(n), and get reward r a ^ n \ 
if a(n) = i then 

set weight w l (n + 1) = w % {n) exp( ) 
else 

w l (n + 1) = w l {n) 
end if 
end while 



Fig. 5. The Exp3 algorithm. 

We simulate and compare the regret of these three algo- 
rithms averaged over 100 runs, under two scenarios, denoted 
SI and S2, respectively. Each scenario involves 5 two-state 
channels with varying transition probabilities. The statistics 
and rewards used are given in Table [V] Exp3 is run with tw o 
different values of a: a\ = 0.1, a 2 = min |l, ^ll^r } 
where N = 10 5 is the time horizon. All arms are as- 
sumed to be in stationary distribution at the beginning. 
1125; 



max max max 



/emin is equal to 9556 in SI and 1037.2 



■>max' max^max/ e min for the rested bandit problem. in S2. 
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SI 


Poi. PlO 




S2 


Poi, PlO 


ro, n 


ch.l 


.01, .03 




ch.l 


.1, .2 


.1, 1 


ch.2 


.04, .01 




ch.2 


.1, .3 


.1, 1 


ch.3 


.03, .01 




ch.3 


.5, .1 


.1, 1 


ch.4 


.02, .01 




ch.4 


.1, .4 


.1, 1 


ch.5 


.01, .02 




ch.5 


.1, .5 


.1, 1 



- RCA. L=1050 

- UCB1, L=1050 

- RCA, L=10 

- UCB1 , L=10 

- Exp3, a=a 
_ Exp3, a=a 



TABLE I 
Channel parameters 



- RCA, L=10000 

- UCB1 , L=10000 

- RCA, L=10 

- UCB1 , L=10 

- Exp3, a=a f 

- Exp3, a=a. 




Fig. 6. Regret under scenario SI 

Results are shown in Figures [6] and [7] under scenarios S 1 
and S2, respectively. We make the following observations 
from this set of curves. Firstly, both RCA's and UCBl's 
performance improves when a smaller value of L is used. 
This suggests that the condition L > 1125^ lax r^ ax 7r^ lax /e m i n 
is sufficient but not in general necessary for the logarithmic 
regret to hold. Secondly, Exp3 shows good performance when 
ci2 is the constant choice, which utilizes the knowledge of 
time horizon. If the time horizon is not given, then Exp3 
has a linear regret instead as was proven in (TJ. Lastly, 
overall the performance of UCB1 is competitive compared 
to RCA, which has been shown to have logarithmic regret in 
the previous section. In particularly, in Figure [6] for L = 10 
UCB1 outperforms RCA significantly. This is because in this 
case the channels are very bursty, thus updating the indices at 
every time step in UCB1 is a better option than waiting for 
regenerative cycles to occur in RCA, which can take a long 
time for an update to occur. These results suggest that there 
may exist logarithmic bounds for UCB1 as well. Furthermore, 
they suggest obvious ways to improve the performance of 
RCA. However, as discussed earlier due to the restless nature 
of the arms when the indices are updated constantly the 
problem becomes intractable. It remains an interesting future 
study to show such bounds for UCB1. 

VI. Conclusion 

We considered the OSA problem when the primary users' 
activities are modeled as generic finite-state Markov chains. 
This was formulated as a single-player restless bandit problem. 
We proposed an algorithm that updates the sample mean based 
indices using regenerative sample paths and showed that its 
regret can be upper bounded uniformly and logarithmically 



1200 
c 1000 
of 800 
600 




Fig. 7. Regret under scenario S2 

over time. This is the first results showing that log-regret is 
possible in a restless bandit learning problem. We numerically 
compare the performance of RCA with two other algorithms, 
UCB1 and Exp3, and conjectured that similar logarithmic 
bounds may exist for UCB1 as well. 
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